Machine Learning on Amazon Retail Data
  • Code
  • By Bhavana
  1. Data Prep / EDA
  • Home
  • Introduction
  • Data Prep / EDA
  • Models and Methods
    • Naive Bayes
    • Clustering
    • Decision Trees
    • Neural Networks
    • PCA and SVD
    • Regression
    • SVM (Support Vector Machine)
  • Conclusions

On this page

  • Data Collection
  • Data Cleaning
  • Data Preprocessing / Visualization
    • Investigation of products in both sets of data

Data Prep / EDA

Where the data source, processing, and visualization (EDA) is presented.

Data Collection

Amazon product information was scraped from the website using the API service ScraperAPI; this is because, as Amazon is a hugely popular website, they have many anti-scraping measures in place such as rate-limiting, IP blocking, dymamic loading, and such. Using the external API service, these limitations were able to be avoided. The search queries chosen to search for items were based on top 100 Amazon searches, found on this site and this site. An example of using the API, along with its core endpoint, is below.

import requests

payload = {
   'api_key': 'API_KEY',
   'query': 'iphone 15 charger',
   's': 'price-asc-rank'
}

response = requests.get('https://api.scraperapi.com/structured/amazon/search',
                        params=payload).json()

The jupyter notebook code for the web scraping can be found here.

Additionally, more data was used to supplement the existing data. Since the scraped data was only about 26K rows, a Kaggle dataset was used that contains more than one million rows, had around the same fields as the scraped data, and was also from the USA (many Amazon Kaggle datasets were from the non-US).

The raw data from both sources can be seen below in Table 1; the scraped raw data CSV can also be viewed here.

Table 1: The raw data from both datasets.
(a) The raw data scraped from Amazon using ScraperAPI
type position asin name image has_prime is_best_seller is_amazon_choice is_limited_deal stars total_reviews url availability_quantity spec price_string price_symbol price original_price section_name
0 search_product 32 B0CV5Z91QR Better Home Products Megan Wooden 6 Drawer Dou... https://m.media-amazon.com/images/I/81fL4qHZ4i... False False False False NaN NaN https://www.amazon.com/Better-Home-Products-Wo... NaN {} NaN NaN NaN NaN NaN
1 search_product 40 B0CTZQ73SH FixtureDisplays® Bathtub and Shower Tension St... https://m.media-amazon.com/images/I/51F0eGaPI+... False False False False NaN NaN https://www.amazon.com/FixtureDisplays%C2%AE-B... NaN {} $22.60 $ 22.60 NaN NaN
2 search_product 15 B0CR633W5W 4-Port USB 3.0 Hub Ultra-Thin Expand PC Connec... https://m.media-amazon.com/images/I/51sJxlimwP... False False False False NaN NaN https://www.amazon.com/4-Port-Ultra-Thin-Expan... NaN {} $3.99 $ 3.99 NaN NaN
3 search_product 15 B0CTQSPM8G MK000960GWSSD 960GB SATA 6G MU SFF SC DS SSD-1... https://m.media-amazon.com/images/I/11rdwQUxu0... False False False False NaN NaN https://www.amazon.com/Genuine-Original-MK0009... NaN {} $375.00 $ 375.00 NaN NaN
4 search_product 44 B0CJ9ZF7LQ SOL DE JANEIROCheirosa '62 Hair & Body Fragran... https://m.media-amazon.com/images/I/31mCVvwGqC... False False False False 4.6 14.0 https://www.amazon.com/SOL-JANEIRO-Cheirosa-Fr... NaN {} $62.00 $ 62.00 NaN NaN
(b) The raw data gotten from Kaggle
asin title imgUrl productURL stars reviews price listPrice category_id isBestSeller boughtInLastMonth
0 B01CR9B6UY DC Cargo E-Track Ratcheting Straps Cargo Tie-D... https://m.media-amazon.com/images/I/81KFTHU7xx... https://www.amazon.com/dp/B01CR9B6UY 4.7 593 36.99 0.00 24 False 100
1 B0BWFRDBBW Little Girls Glitter Tulle Dress, Sparkle Polk... https://m.media-amazon.com/images/I/711QwZxcW8... https://www.amazon.com/dp/B0BWFRDBBW 3.9 7 46.99 0.00 91 False 0
2 B00XV8CQGO Club Little Girl Summer Animal Collection Stic... https://m.media-amazon.com/images/I/41wQCnt7+E... https://www.amazon.com/dp/B00XV8CQGO 0.0 0 9.99 0.00 95 False 50
3 B08MBJX3K1 Toddler Boys Girls Sneakers Size 5-12 Lightwei... https://m.media-amazon.com/images/I/81IscOXT6S... https://www.amazon.com/dp/B08MBJX3K1 4.5 0 31.99 35.99 97 False 0
4 B08YKJDWLF CFS Cut to Fit Carbon Pad Pre Filter Roll for ... https://m.media-amazon.com/images/I/61j4Gpu4jH... https://www.amazon.com/dp/B08YKJDWLF 4.6 0 12.59 0.00 171 False 0

Data Cleaning

The datasets were cleaned seperately, then concatenated, then some final steps were taken to clean it.

The steps to clean the web-scaped data were:

  • Add date_scraped column
  • Remove unecessary columns: type, position, has_prime, is_amazon_choice, is_limited_deal, availability_quantity, spec, price_string, price_symbol, section_name
  • Expand and fix original_price
  • Rename columns to match standard snake case for merging both datasets
  • Drop rows with no asin or name or price
  • Drop rows with price of 0.0, since that doesn’t make sense
  • Fill NaN reviews column with 0

The steps to clean the Kaggle data were:

  • Add date_scraped column
  • Drop rows with any NaNs
  • Fix list_price of $0 to be instead equal to price
  • Change category_id to actual category by using category table
  • Drop rows with price of $0, since that doesn’t make sense
  • Rename columns to match standard snake case for merging both datasets

And then, after they were concatenated, the steps to clean were:

  • Remove duplicates (by asin + date scraped)
  • Rename columns

The final cleaned (and concatenated) dataset can be seen in Table 2 (with the original raw data in Table 1):

Table 2: The final unioned, cleaned, and processed data.
Asin Name Image Url Is Best Seller Stars Reviews Url Price Date Scraped List Price Bought In Month Category
0 B077H6599Q Generac 7103 Cold Kit for 9kW - 22kW Air Coole... https://m.media-amazon.com/images/I/71+HddWhsK... False 4.5 0.0 https://www.amazon.com/dp/B077H6599Q 94.99 2023-11-01 105.99 0.0 RV Parts & Accessories
1 B095HLD52Z 3 Pack Apple MFi Certified iPhone Charger Cabl... https://m.media-amazon.com/images/I/61rqFEt6ku... False 4.5 0.0 https://www.amazon.com/dp/B095HLD52Z 20.99 2023-11-01 20.99 0.0 Accessories & Supplies
2 B08XK4CBP8 164pcs Blush White Balloons Garland Arch Kit P... https://m.media-amazon.com/images/I/71oD9RPq84... False 4.5 0.0 https://www.amazon.com/dp/B08XK4CBP8 14.99 2023-11-01 16.99 50.0 Party Decorations
3 B0C1Z9G9J9 Mudtun Bone Conduction Earbuds for Small Ear C... https://m.media-amazon.com/images/I/51js5mISHA... False 3.5 74.0 https://www.amazon.com/dp/B0C1Z9G9J9 19.99 2023-11-01 19.99 0.0 Headphones & Earbuds
4 B0C5NB4SRN V-SHURA Music Gifts Table Lamp, Guitar Table L... https://m.media-amazon.com/images/I/710DKiJYTV... False 4.8 0.0 https://www.amazon.com/dp/B0C5NB4SRN 55.99 2023-11-01 55.99 0.0 Home Lighting & Ceiling Fans

The code for the data cleaning can be found here.

Data Preprocessing / Visualization

Various types of EDA were performed in order to examine the data; as a note, most visuals are interactive (zoomable, pannable, etc). The code for all visualizations can be found here.

Important

If the interactive figures don’t load (is blank screen space with caption showing), dont worry: just turn off all ad-blockers/privacy browsing, make sure that Chrome/Firefox is being used, and refresh the page (or close and re-open the website) until all figures load.

First, a histogram of all categories of all Amazon products is shown in Figure 1. Note scraped data did not have categories, but the Kaggle data did. It can be seen that from the figure, girl’s and boy’s clothing are the most populous categories, with toys & games the next most populous.

Figure 1: A histogram of all categories of all Amazon products. Note scraped data did not have categories, but the Kaggle data did.

Figure 2 shows a histogram of the stars recieved for a product. It can be seen that the number of stars increases generally exponentially until 4.6 stars, with peaks at 0 and 5 stars as well. This data generally agrees with what can be seen on Amazon when browsing personally, and the peaks at 0 and 5 agree with having low number of reviews.

Figure 2: Histogram of the number of stars recieved for all products.

Figure 3 shows violin plots of the numerical factors: volume of product bought in the month, the price of a product, and the number of reviews. A violin plot was chosen because a histogram would be visually uninformative, due to the amount of outliers. Some insights that can be drawn from these violin plots are:

  • These factors have a good number of outliers that are extremely large in value compared to the general distribution of values. This does not seem to be from incorrect values, but from sellers attempting to “game the system”.
  • “Bought in Month” has outliers that are regularly spaced, indicating either a form of gaming the system by sellers that tend to follow specific formats, or Amazon only reports volume of bought products at low granularity for high-volume products.
  • Price tends to have outliers that continue regularly up the price scale, except for a couple at the $60-70k range. It is unclear whether to drop those rows containing those prices as they seem valid; it is entirely possible that sellers just list the product at a completely unreasonable price as a placeholder, and don’t expect anyone to purchase at that price.
  • The number of reviews for a product have been shown to be easily gameable with fake reviews, and is likely what is happening here.

Violin plot of the number of a product bought in a month

Violin plot of price of a product

Violin plot of the number of reviews
Figure 3: Violin plots (with overlayed box plot and outliers) of 3 of the main numerical columns.

The amount of stars, vs the number of reviews for a product can be seen in Figure 4, colored with the Is Best Seller flag. It can be seen that the distribution of Is Best Seller items tends to follow the overall distribution; additionally, the number of reviews tends to increase as the number of stars increases, with the products with the outlier number of reviews (over 200k) all above 4 stars.

Figure 4: Stars vs number of reviews recieved by an amazon product, colored by whether the product was a best-seller.

In order to visualize the content of the text factors, namely the Category and Name of a product, a wordcloud format was used; wordclouds visualize the words that tend to appear in the text with the more frequent words appearing in larger text. In Figure 5, it can be seen that in the categories, “Care”, “Products”, “Boy’s Clothing”, “Consoles”, “Accessories”, and “Games” appear most. In the names of procucts, the words “Stainles Steel”, “Compatible”, “Heavy Duty”, “Boy”, “Girl”, and “Birthday” appear the most.

Wordcloud for categories of products

Wordcloud for the names of products
Figure 5: Wordclouds (where more frequent appearing words are bigger) of the categories of products and the names of products.

The number of stars vs price, with price being on a log scale (to hangle outliers) is shown in Figure 6, along with being colored by if it’s a best seller. As was seen in Figure 4, the distribution of Is Best Seller items tends to follow the overall distribution; additionally, it can be seen that there are products with a price up to $10k across all number of stars.

Figure 6: The number of stars a product recieved, vs the price in log scale, colored by whether the product was a best seller.

Investigation of products in both sets of data

The following plots focused on products that had the same Asin (identifying ID), that were in both the data personally scraped in 2024 and the Kaggle data from 2023. Investigating this subset of data could reveal a lot about how product prices, reviews, and other factors changed over time, as the only way to get time difference data was to use products that actually had more than one timepoint.

Figure 7 shows the price vs list price of items, colored by what date they were scraped on, with a trendline plotted. It can be seen that on the whole, list price and actual price match and that prices seem to have increased since 2023 to 2024; however, below $50 price (when zooming in) the trendlines cross, implying that lower-priced products actually decreased in price.

Figure 7: Price vs list price of items with the same ASIN across dates scraped, with a trendline.

Given that outliers can be seen in price affecting the plot of the graph, it was decided for analysis to only consider those prices most populous, aka prices less than $800. Given that, the prices of each Asin were plotted, along with being colored by if the price increased from 2023 to 2024, or decreased. It can be seen that in Figure 8 that indeed for lower priced products (less than $50) the prices mostly did not increase, and instead decreased or stayed the same. The histogram shows the frequency of higher-priced procucts generally exponentially decreasing, and the likelihood of the price increasing going up.

Figure 8: Histogram of prices, colored by whether the change in price increased or decreased over time, for those items that were in both sets of data.

Figure 9 shows a more granular form of the previous plot by plotting price vs the actual price difference amount from 2023 to 2024, with the ability to investigate which products are having such a dramatic increase or decrease in price over time. It can be seen that tech and suitcases were the outliers in price increase, and tech also being the outliers in price decrease.

Figure 9: Price vs the difference in price, over the two sets of data, colored by whether the price diff increased or decreased.

Finally, it was investigated as to which product categories exactly had a price increase or decrease, and the ratio of number of increases to decreases. Only categories with more than 20 products were included, to reduce noise and to focus on meaningful categories. In Figure 10, it can be seen that “Sports and Outdoors”, “Makeup”, “Electrical Equipment”, and “Building Toys” all had more price increases than decreases; every other product category had more price decreases than increases, with “Smart Homes” and “Gift Cards” having the most price decreases.

Figure 10: For the included categories, a bar chat showing the ratio of products that had their price increase over time vs decrease.
Introduction
Naive Bayes